Copy to BDA enables you to copy tables from an Oracle database into Hadoop. After generating Data Pump format files from the tables and copying the files to HDFS, you can use Apache Hive to query the data. Hive can process the data locally without accessing Oracle Database. When the Oracle table changes, you can refresh the copy in Hadoop. Copy to BDA is primarily useful for Oracle tables that are relatively static, and thus do not require frequent refreshes.
Copy to BDA is licensed under Oracle Big Data SQL. You must have an Oracle Big Data SQL license to use Copy to BDA.
Copy to BDA is available only on Oracle Exadata Database Machine connected to Oracle Big Data Appliance.
Oracle Exadata Database Machine must comply with the following requirements:
Configured on the same InfiniBand or client network as Oracle Big Data Appliance. Oracle recommends an InfiniBand connection between Oracle Exadata Database Machine and Oracle Big Data Appliance, but it is not required.
Runs Oracle Database 11.2 or later.
Copy to BDA supports earlier releases than Oracle Big Data SQL.
Copy to BDA is a component of Oracle Big Data SQL, which is an installation option on Oracle Big Data Appliance. You can enable Oracle Big Data SQL either during the initial software installation or at a later time using the standard methods for enabling and disabling services. See "Performing the Installation".
Copy to BDA only requires a Hadoop client on Oracle Exadata Database Machine. It does not employ the additional software required by Oracle Big Data SQL.
If you plan to use Oracle Big Data SQL also, then the Hadoop client is created automatically when you run the bds-exa-install.sh
installation script. You do not need to take any additional steps. See "Running the Post-Installation Script for Oracle Big Data SQL".
If you do not plan to use Oracle Big Data SQL at this time, then you can install the Hadoop client manually instead of running the script. For example, Oracle Big Data SQL might not be supported with the version of Oracle Database you are using. Or you might want to avoid installing a database patch and other software that you do not currently need. See "Providing Remote Client Access to CDH".
The SQL CREATE TABLE
statement has a clause specifically for creating external tables, in which you specify the ORACLE_DATAPUMP
access driver. The information that you provide in this clause enables the access driver to generate a Data Pump format file that contains the data and metadata from the Oracle database table.
This section contains the following topics:
Data Pump files are typically used to move data and metadata from one database to another. Copy to BDA uses this file format to copy data from an Oracle database to HDFS.
To generate Data Pump format files, you create an external table from an existing Oracle table. An external table in Oracle Database is an object that identifies and describes the location of data outside of a database. External tables use access drivers to parse and format the data. For Copy to BDA, you use the ORACLE_DATAPUMP
access driver. It copies the data and metadata from internal Oracle tables and populates the Data Pump format files of the external table.
You must have read and write access to a database directory in Oracle Database. Only Oracle Database users with the CREATE ANY DIRECTORY
system privilege can create directories.
This example creates a database directory named EXPORTDIR
that points to the /exportdir
directory on Oracle Exadata Database Machine:
SQL> CREATE DIRECTORY exportdir AS '/exportdir';
The following is the basic syntax of the CREATE TABLE
statement for Data Pump format files:
CREATE TABLE table_name ORGANIZATION EXTERNAL ( TYPE oracle_datapump DEFAULT DIRECTORY database_directory LOCATION ('filename1.dmp','filename2.dmp'...) ) PARALLEL n AS SELECT * FROM tablename;
Identifies the database directory that you created for this purpose. See "Identifying the Target Directory".
Lists the names of the Data Pump files to be created. The number of names should match the degree of parallelism (DOP) specified by the PARALLEL
clause. Otherwise, the DOP drops to the number of files.
The number of files and the degree of parallelism affect the performance of Oracle Database when generating the Data Pump format files. They do not affect querying performance in Hive.
Sets the degree of parallelism (DOP). Use the maximum number that your Oracle DBA permits you to use. By default the DOP is 1, which is serial processing. Larger numbers enable parallel processing.
Use the full SQL SELECT
syntax for this clause. It is not restricted. The tablename identifies the Oracle table to be copied to HDFS.
See Also:
For descriptions of these parameters:
The Oracle Big Data SQL installation installs Hadoop client files to Oracle Exadata Database Machine. The Hadoop client installation enables you to use Hadoop commands to copy the Data Pump files to HDFS. You must have write privileges on the HDFS directory.
To copy the dmp
files into HDFS, use the hadoop fs -put
command. This example copies the files into the HDFS customers
directory owned by the oracle
user:
$ hadoop fs -put customers*.dmp /user/oracle/customers
To provide access to the data in the Data Pump files, you create a Hive external table over the Data Pump files. Copy to BDA provides SerDes that enable Hive to read the files. These SerDes are read only, so you cannot use them to write to the files.
See Also:
Apache Hive Language Manual DDL at
For external tables, Hive loads the table metadata into its metastore. The data remains in its original location, which you identify in the LOCATION
clause. If you drop an external table using a HiveQL DROP TABLE
statement, then only the metadata is discarded, while the external data remains unchanged. In this respect, Hive handles external tables in fundamentally the same way as Oracle Database.
External tables support data sources that are shared by multiple programs. In this case, you use Oracle Database to update the data and then generate a new file. You can overwrite the old HDFS files with the updated files while leaving the Hive metadata intact.
The following is the basic syntax of a Hive CREATE TABLE
statement for creating a Hive external table for use with a Data Pump format file:
CREATE EXTERNAL TABLE tablename ROW FORMAT SERDE 'oracle.hadoop.hive.datapump.DPSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.hive.datapump.DPInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs_directory'
The Hive table columns automatically have the same names as the Oracle columns, which are provided by the metadata stored in the Data Pump files. In this release, any user-specified column definitions are ignored.
Copy to BDA automatically converts the data in an Oracle table to an appropriate Hive data type. Table 8-1 shows the default mappings between Oracle and Hive data types.
Table 8-1 Oracle to Hive Data Type Conversions
Oracle Data Type | Hive Data Type |
---|---|
NUMBER |
INT when the scale is 0 and the precision is less than 10 BIGINT when the scale is 0 and the precision is less than 19 DECIMAL when the scale is greater than 0 or the precision is greater than 19 |
CLOB NCLOB |
STRING |
BINARY_DOUBLE |
DOUBLE |
BINARY_FLOAT |
FLOAT |
BLOB |
BINARY |
CHAR NCHAR |
CHAR |
VARCHAR2 NVARCHAR2 |
VARCHAR |
ROWID UROWID |
BINARY |
DATE |
TIMESTAMP |
TIMESTAMP |
TIMESTAMP |
TIMESTAMPTZ1 TIMESTAMPLTZ |
Unsupported |
RAW |
BINARY |
This example shows all steps in the process of creating a Hive table from an Oracle table using Copy to BDA.
The Oracle tables are from the Sales History (SH) sample schema. The CUSTOMERS
table provides extensive information about individual customers, including names, addresses, telephone numbers, birth dates, and credit limits. The COUNTRIES
table provides a list of countries, and identifies regions and subregions.
This query shows a small selection of data in the CUSTOMERS
table:
SELECT cust_first_name first_name, cust_last_name last_name, cust_gender gender, cust_year_of_birth birth FROM customers ORDER BY cust_city, last_name FETCH FIRST 10 ROWS ONLY;
The query returns the following rows:
FIRST_NAME LAST_NAME GENDER BIRTH --------------- -------------------- ------ ---------- Lise Abbey F 1963 Lotus Alden M 1958 Emmanuel Aubrey M 1933 Phil Ball M 1956 Valentina Bardwell F 1965 Lolita Barkley F 1966 Heloise Barnes M 1980 Royden Barrett M 1937 Gilbert Braun M 1984 Portia Capp F 1948
To reproduce this example, install the sample schemas in Oracle Database and connect as the SH
user.
See Also:
Oracle Database Sample Schemas for descriptions of the tables and installation instructions for the schemas.
These SQL statements create a local database directory named EXPDIR
and grant access to the SH
user:
SQL> CREATE DIRECTORY expdir AS '/expdir'; Directory created. SQL> GRANT READ, WRITE ON DIRECTORY expdir TO SH; Grant succeeded.
The following examples show how to create the Data Pump files and check their contents.
Copy to BDA supports only the syntax shown in the examples. Data pump files created with the Export utility or Oracle Data Pump are not compatible.
This example shows a very simple SQL command for creating a Data Pump format file from the CUSTOMERS
table. It selects the entire table and generates a single output file named customers.dmp
in the local /expdir
directory.
CREATE TABLE export_customers ORGANIZATION EXTERNAL ( TYPE oracle_datapump DEFAULT DIRECTORY expdir LOCATION('customers.dmp') ) AS SELECT * FROM customers;
The next example shows more complexity in the syntax. It joins the CUSTOMERS
and COUNTRIES
tables on the COUNTRY_ID
columns to provide the country names. It also limits the rows to customers in the Americas. The command generates two output files in parallel, named americas1.dmp
and americas2.dmp
, in the local /expdir
directory.
CREATE TABLE export_americas ORGANIZATION EXTERNAL ( TYPE oracle_datapump DEFAULT DIRECTORY expdir LOCATION('americas1.dmp', 'americas2.dmp') ) PARALLEL 2 AS SELECT a.cust_first_name first_name, a.cust_last_name last_name, a.cust_gender gender, a.cust_year_of_birth birth, a.cust_email email, a.cust_postal_code postal_code, b.country_name country FROM customers a, countries b WHERE a.country_id=b.country_id AND b.country_region='Americas' ORDER BY a.country_id, a.cust_postal_code;
You can check the content of the output data files before copying them to Hadoop. The previous CREATE TABLE
statement created an external table named EXPORT_AMERICAS
, which you can describe and query the same as any other table.
The DESCRIBE
statement shows the selection of columns and the modified names:
SQL> DESCRIBE export_americas;
Name Null? Type
------------------------- -------- -----------------
FIRST_NAME NOT NULL VARCHAR2(20)
LAST_NAME NOT NULL VARCHAR2(40)
GENDER NOT NULL CHAR(1)
BIRTH NOT NULL NUMBER(4)
EMAIL VARCHAR2(50)
POSTAL_CODE NOT NULL VARCHAR2(10)
COUNTRY NOT NULL VARCHAR2(40)
A SELECT
statement like the following shows a sample of the data:
SELECT first_name, last_name, gender, birth, country FROM export_americas WHERE birth > 1985 ORDER BY last_name FETCH FIRST 5 ROWS ONLY;
FIRST_NAME LAST_NAME GENDER BIRTH COUNTRY --------------- -------------------- ------ ---------- ------------------------ Opal Aaron M 1990 United States of America KaKit Abeles M 1986 United States of America Mitchel Alambarati M 1987 Canada Jade Anderson M 1986 United States of America Roderica Austin M 1986 United States of America
The following commands list the files in the local expdir
directory, create a Hadoop subdirectory named customers
, and copy the files to it. The user is connected to Oracle Big Data Appliance as the oracle
file system user.
$ cd /expdir $ ls americas*.dmp americas1.dmp americas2.dmp $ hadoop fs -mkdir customers $ hadoop fs -put *.dmp customers $ hadoop fs -ls customers Found 2 items -rw-r--r-- 1 oracle oracle 798720 2014-10-13 17:04 customers/americas1.dmp -rw-r--r-- 1 oracle oracle 954368 2014-10-13 17:04 customers/americas2.dmp
This HiveQL statement creates an external table using the Copy to BDA SerDes. The LOCATION
clause identifies the full path to the Hadoop directory containing the Data Pump files:
CREATE EXTERNAL TABLE customers ROW FORMAT SERDE 'oracle.hadoop.hive.datapump.DPSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.hive.datapump.DPInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION '/user/oracle/customers';
The DESCRIBE
command shows the columns of the CUSTOMERS
external table.
hive> DESCRIBE customers;
OK
first_name varchar(20) from deserializer
last_name varchar(40) from deserializer
gender char(1) from deserializer
birth int from deserializer
email varchar(50) from deserializer
postal_code varchar(10) from deserializer
country varchar(40) from deserializer
The following HiveQL SELECT
statement shows the same data as the SQL SELECT
statement from Oracle Database shown in "Verifying the Contents of the Data Files". The two queries access copies of the same Data Pump files.
SELECT first_name, last_name, gender, birth, country FROM customers WHERE birth > 1985 ORDER BY last_name LIMIT 5; Total MapReduce jobs = 1 Launching Job 1 out of 1 . . . OK Opal Aaron M 1990 United States of America KaKit Abeles M 1986 United States of America Mitchel Alambarati M 1987 Canada Jade Anderson M 1986 United States of America Roderica Austin M 1986 United States of America
To copy TIMESTAMPTZ
and TIMESTAMPLTZ
data to Hive, cast the columns to TIMESTAMP
when exporting them to the Data Pump files. Hive does not have a data type that supports time zones or time offsets.